Achieving
high availability requires that risks are identified and addressed.
Many organizations employ risk management practices to capture and
address potential disruptions to business processes. These practices
usually consist of the following phases:
Identification
This phase includes the documentation of areas of risk within the
business. These range from loss of a large customer and the associated
revenue all the way to a disaster that destroys a company datacenter.
Assessment This phase includes the analysis of the identified risks to determine the probability and the impact of each.
Mitigation
This phase includes creating a plan for mitigating each potential risk.
The mitigation plans for each risk fall into the following three
categories:
Acceptance
This is done when a risk is accepted, usually because the probability
of occurrence is so low it doesn't require mitigation or the cost
outweighs the consequences of the risk. A risk that might fall into this
category is the probability of datacenters that are 20 miles apart
being affected by the same tornado. Although this is possible, the
likelihood is so small that is acceptable.
Transference
This is done when the risk is mitigated by obtaining insurance or by
outsourcing the risk to others to manage. A risk that might fall into
this category is outsourcing inbound anti-spam and antivirus services to
Microsoft Exchange Hosted Services to handle inbound e-mail.
Reduction
This is done when the risk can be managed to a point where it is less
probable or can be recovered from quickly. A risk that might fall into
this category is deploying a cross-site DAG in two datacenters to reduce the likelihood that a single site failure can cause a messaging system outage.
Implementation This phase includes putting the risk mitigation into practice.
Review This phase evaluates the risk
mitigation plan to verify that it has addressed the identified risks
and to evaluate whether any new risks have been introduced.
Not only should risk
management be practiced at the business level, but it must also be
performed for IT solutions, such as the Exchange messaging environment.
As you perform risk identification for your messaging environment you
may list disk failure, server motherboard failure, loss of Internet
connectivity, security breaches, site failures, and employee mistakes as
risks. The assessment and mitigation process may create a list similar
to the one in Table 1.
Table 1. Exchange Risk Mitigation
RISK | MITIGATION |
---|
Mailbox Server Disk Failure | Reduction: Use a RAID configuration or rely on DAG replication. |
Server Motherboard Failure | Reduction: Use a DAG for Mailbox servers and deploy multiple Transport and Client Access servers. |
DNS Server Failure | Reduction: Deploy multiple DNS servers and configure servers to use them. |
Domain Controller Failure | Reduction: Deploy multiple domain controllers in each site. |
Network Device Failure | Reduction: Deploy redundant network devices. |
Loss of Internet connectivity | Reduction: Add additional Internet providers. Transference: Host servers in a colocation facility. |
Security Breaches | Reduction:
Good update management; implement intrusion detection and prevention
systems. Transference: Outsource security to an experienced third-party
provider. |
Site Failures | Reduction: Deploy a failover site. |
Employee Mistakes | Reduction: Provide training for employees and automate many common tasks. |
One of the best ways to
mitigate risk is to periodically test any disaster avoidance or recovery
practices that have been put into place. This allows these measures to
be tested and refined in a controlled environment, and in the end
reduces risk. Often small details can be overlooked in a plan that cause
delays in the recovery. For some organizations the primary datacenter
is colocated in the same facility as the office space. In a situation
where the primary facility is no longer viable and the IT systems are
operational in the secondary datacenter, the users will still need
another location to work. The processes and procedures for accessing the
new location and notifying customers must also be worked out.
These fire drills also
provide the opportunity to teach the employees the importance the
business places on recovery and reinforces the mind-set to work toward
that goal during all of their day-to-day responsibilities.